sample selection bia
Robust Classification Under Sample Selection Bias
In many important machine learning applications, the source distribution used to estimate a probabilistic classifier differs from the target distribution on which the classifier will be used to make predictions. Due to its asymptotic properties, sample reweighted empirical loss minimization is a commonly employed technique to deal with this difference. However, given finite amounts of labeled source data, this technique suffers from significant estimation errors in settings with large sample selection bias. We develop a framework for learning a robust bias-aware (RBA) probabilistic classifier that adapts to different sample selection biases using a minimax estimation formulation. Our approach requires only accurate estimates of statistics under the source distribution and is otherwise as robust as possible to unknown properties of the conditional label distribution, except when explicit generalization assumptions are incorporated. We demonstrate the behavior and effectiveness of our approach on binary classification tasks.
- North America > United States > Illinois > Cook County > Chicago (0.05)
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- (2 more...)
- Research Report > New Finding (0.31)
- Research Report > Experimental Study (0.31)
Robust Classification Under Sample Selection Bias
In many important machine learning applications, the source distribution used to estimate a probabilistic classifier differs from the target distribution on which the classifier will be used to make predictions. Due to its asymptotic properties, sample-reweighted loss minimization is a commonly employed technique to deal with this difference. However, given finite amounts of labeled source data, this technique suffers from significant estimation errors in settings with large sample selection bias. We develop a framework for robustly learning a probabilistic classifier that adapts to different sample selection biases using a minimax estimation formulation. Our approach requires only accurate estimates of statistics under the source distribution and is otherwise as robust as possible to unknown properties of the conditional label distribution, except when explicit generalization assumptions are incorporated.
Alleviating the Sample Selection Bias in Few-shot Learning by Removing Projection to the Centroid
Despite the emergence of a number of few-shot learning methods, the sample selection bias problem, i.e., the sensitivity to the limited amount of support data, has not been well understood. In this paper, we find that this problem usually occurs when the positions of support samples are in the vicinity of task centroid--the mean of all class centroids in the task. This motivates us to propose an extremely simple feature transformation to alleviate this problem, dubbed Task Centroid Projection Removing (TCPR). TCPR is applied directly to all image features in a given task, aiming at removing the dimension of features along the direction of the task centroid. While the exact task centoid cannot be accurately obtained from limited data, we estimate it using base features that are each similar to one of the support features.
A Semi-supervised CART Model for Covariate Shift
Cai, Mingyang, Klausch, Thomas, van de Wiel, Mark A.
Machine learning models used in medical applications often face challenges due to the covariate shift, which occurs when there are discrepancies between the distributions of training and target data. This can lead to decreased predictive accuracy, especially with unknown outcomes in the target data. This paper introduces a semi-supervised classification and regression tree (CART) that uses importance weighting to address these distribution discrepancies. Our method improves the predictive performance of the CART model by assigning greater weights to training samples that more accurately represent the target distribution, especially in cases of covariate shift without target outcomes. In addition to CART, we extend this weighted approach to generalized linear model trees and tree ensembles, creating a versatile framework for managing the covariate shift in complex datasets. Through simulation studies and applications to real-world medical data, we demonstrate significant improvements in predictive accuracy. These findings suggest that our weighted approach can enhance reliability in medical applications and other fields where the covariate shift poses challenges to model performance across various data distributions.
- Europe > Netherlands > North Holland > Amsterdam (0.04)
- Asia > Middle East > Jordan (0.04)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (0.93)
- Health & Medicine > Therapeutic Area > Oncology (0.93)
- Health & Medicine > Therapeutic Area > Endocrinology > Diabetes (0.93)
- Health & Medicine > Pharmaceuticals & Biotechnology (0.67)
- Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.68)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.46)
Contextual Representation Anchor Network to Alleviate Selection Bias in Few-Shot Drug Discovery
Li, Ruifeng, Liu, Wei, Zhou, Xiangxin, Li, Mingqian, Zhang, Qiang, Chen, Hongyang, Lin, Xuemin
In the drug discovery process, the low success rate of drug candidate screening often leads to insufficient labeled data, causing the few-shot learning problem in molecular property prediction. Existing methods for few-shot molecular property prediction overlook the sample selection bias, which arises from non-random sample selection in chemical experiments. This bias in data representativeness leads to suboptimal performance. To overcome this challenge, we present a novel method named contextual representation anchor Network (CRA), where an anchor refers to a cluster center of the representations of molecules and serves as a bridge to transfer enriched contextual knowledge into molecular representations and enhance their expressiveness. CRA introduces a dual-augmentation mechanism that includes context augmentation, which dynamically retrieves analogous unlabeled molecules and captures their task-specific contextual knowledge to enhance the anchors, and anchor augmentation, which leverages the anchors to augment the molecular representations. We evaluate our approach on the MoleculeNet and FS-Mol benchmarks, as well as in domain transfer experiments. The results demonstrate that CRA outperforms the state-of-the-art by 2.60% and 3.28% in AUC and $\Delta$AUC-PR metrics, respectively, and exhibits superior generalization capabilities.
- Research Report > New Finding (0.48)
- Research Report > Promising Solution (0.34)
- Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
- Education (1.00)
Sample Selection Bias in Machine Learning for Healthcare
Chauhan, Vinod Kumar, Clifton, Lei, Salaün, Achille, Lu, Huiqi Yvonne, Branson, Kim, Schwab, Patrick, Nigam, Gaurav, Clifton, David A.
While machine learning algorithms hold promise for personalised medicine, their clinical adoption remains limited. One critical factor contributing to this restraint is sample selection bias (SSB) which refers to the study population being less representative of the target population, leading to biased and potentially harmful decisions. Despite being well-known in the literature, SSB remains scarcely studied in machine learning for healthcare. Moreover, the existing techniques try to correct the bias by balancing distributions between the study and the target populations, which may result in a loss of predictive performance. To address these problems, our study illustrates the potential risks associated with SSB by examining SSB's impact on the performance of machine learning algorithms. Most importantly, we propose a new research direction for addressing SSB, based on the target population identification rather than the bias correction. Specifically, we propose two independent networks (T-Net) and a multitasking network (MT-Net) for addressing SSB, where one network/task identifies the target subpopulation which is representative of the study population and the second makes predictions for the identified subpopulation. Our empirical results with synthetic and semi-synthetic datasets highlight that SSB can lead to a large drop in the performance of an algorithm for the target population as compared with the study population, as well as a substantial difference in the performance for the target subpopulations that are representative of the selected and the non-selected patients from the study population. Furthermore, our proposed techniques demonstrate robustness across various settings, including different dataset sizes, event rates, and selection rates, outperforming the existing bias correction techniques.
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.28)
- Asia > China > Hong Kong (0.04)
- Oceania > New Zealand > North Island > Auckland Region > Auckland (0.04)
- (5 more...)
- Health & Medicine > Therapeutic Area (1.00)
- Health & Medicine > Epidemiology (0.95)
- Health & Medicine > Health Care Technology (0.67)
Estimation Bias in Multi-Armed Bandit Algorithms for Search Advertising
In search advertising, the search engine needs to select the most profitable advertisements to display, which can be formulated as an instance of online learning with partial feedback, also known as the stochastic multi-armed bandit (MAB) problem. In this paper, we show that the naive application of MAB algorithms to search advertising for advertisement selection will produce sample selection bias that harms the search engine by decreasing expected revenue and "estimation of the largest mean" (ELM) bias that harms the advertisers by increasing game-theoretic player-regret. We then propose simple bias-correction methods with benefits to both the search engine and the advertisers.
- Asia (0.04)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Marketing (1.00)
- Information Technology > Services (0.92)
- Information Technology > Data Science > Data Mining > Big Data (1.00)
- Information Technology > Artificial Intelligence (1.00)
- Information Technology > Information Management > Search (0.90)
Robust Classification Under Sample Selection Bias
In many important machine learning applications, the source distribution used to estimate a probabilistic classifier differs from the target distribution on which the classifier will be used to make predictions. Due to its asymptotic properties, sample reweighted empirical loss minimization is a commonly employed technique to deal with this difference. However, given finite amounts of labeled source data, this technique suffers from significant estimation errors in settings with large sample selection bias. We develop a framework for learning a robust bias-aware (RBA) probabilistic classifier that adapts to different sample selection biases using a minimax estimation formulation. Our approach requires only accurate estimates of statistics under the source distribution and is otherwise as robust as possible to unknown properties of the conditional label distribution, except when explicit generalization assumptions are incorporated. We demonstrate the behavior and effectiveness of our approach on binary classification tasks.
- North America > United States > Illinois > Cook County > Chicago (0.05)
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- (2 more...)
- Research Report > New Finding (0.31)
- Research Report > Experimental Study (0.31)
Leveraging Ensemble Diversity for Robust Self-Training in the Presence of Sample Selection Bias
Odonnat, Ambroise, Feofanov, Vasilii, Redko, Ievgen
Self-training is a well-known approach for semi-supervised learning. It consists of iteratively assigning pseudo-labels to unlabeled data for which the model is confident and treating them as labeled examples. For neural networks, softmax prediction probabilities are often used as a confidence measure, despite the fact that they are known to be overconfident, even for wrong predictions. This phenomenon is particularly intensified in the presence of sample selection bias, i.e., when data labeling is subject to some constraint. To address this issue, we propose a novel confidence measure, called $\mathcal{T}$-similarity, built upon the prediction diversity of an ensemble of linear classifiers. We provide the theoretical analysis of our approach by studying stationary points and describing the relationship between the diversity of the individual members and their performance. We empirically demonstrate the benefit of our confidence measure for three different pseudo-labeling policies on classification datasets of various data modalities.
- North America > United States > New York > New York County > New York City (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Synthia's Melody: A Benchmark Framework for Unsupervised Domain Adaptation in Audio
Lin, Chia-Hsin, Jones, Charles, Schuller, Björn W., Coppock, Harry
Despite significant advancements in deep learning for vision and natural language, unsupervised domain adaptation in audio remains relatively unexplored. We, in part, attribute this to the lack of an appropriate benchmark dataset. To address this gap, we present Synthia's melody, a novel audio data generation framework capable of simulating an infinite variety of 4-second melodies with user-specified confounding structures characterised by musical keys, timbre, and loudness. Unlike existing datasets collected under observational settings, Synthia's melody is free of unobserved biases, ensuring the reproducibility and comparability of experiments. To showcase its utility, we generate two types of distribution shifts-domain shift and sample selection bias-and evaluate the performance of acoustic deep learning models under these shifts. Our evaluations reveal that Synthia's melody provides a robust testbed for examining the susceptibility of these models to varying levels of distribution shift.
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- North America > Mexico > Gulf of Mexico (0.04)
- North America > Canada (0.04)
- (3 more...)
- Media > Music (1.00)
- Leisure & Entertainment (1.00)